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/^-norm Penalized Orthogonal Forward Regression 

Xia Hong, Sheng Chen, Yi Guo, and Junbin Gao 


Abstract — A F-norm penalized orthogonal forward regression 
(d-POFR) algorithm is proposed based on the concept of leave- 
one-out mean square error (LOOMSE). Firstly, a new d-norm 
penalized cost function is defined in the constructed orthogonal 
space, and each orthogonal basis is associated with an individu¬ 
ally tunable regularization parameter. Secondly, due to orthog¬ 
onal computation, the LOOMSE can be analytically computed 
without actually splitting the data set, and moreover a closed 
form of the optimal regularization parameter in terms of minimal 
LOOMSE is derived. Thirdly, a lower bound for regularization 
parameters is proposed, which can be used for robust LOOMSE 
estimation by adaptively detecting and removing regressors to 
an inactive set so that the computational cost of the algorithm 
is significantly reduced. Illustrative examples are included to 
demonstrate the effectiveness of this new Z^-POFR approach. 

Index Terms — Cross validation, forward regression, leave-one- 
out errors, regularization 


I. Introduction 

One of the main aims in data modeling is good general¬ 
ization, i.e. the model’s capability to approximate accurately 
the system output for unseen data. Sparse models can be 
constructed using the /^-penalized cost function, e.g. the basis 
pursuit or least absolute shrinkage and selection operator 
(LASSO) d-ia. Based on a fixed single /^-penalized regular¬ 
ization parameter, the LASSO can be configured as a standard 
quadratic programming optimization problem. By exploiting 
piecewise linearity of the problem, the least angle regression 
(LAR) procedure 0 is developed for solving the problem 
efficiently. Note that the computational efficiency in LASSO 
is facilitated by a single regularization parameter setting. For 
more complicated constraints, e.g. multiple regularizers, the 
cross validation by actually splitting data sets as the means 
of evaluating model generalization comes with considerably 
large overall computational overheads. 

Alternatively the forward orthogonal least squares (OLS) 
algorithm efficiently constructs parsimonious models a, Q. 
Fundamental to the evaluation of model generalization capabil¬ 
ity is the concept of cross-validation a, and one commonly 
used version of cross-validation is the leave-one-out (LOO) 
cross validation. For the linear-in-the-parameters models, the 
LOO mean square error (LOOMSE) can be calculated without 
actually splitting the training data set and estimating the 
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associated models, by making use of the Sherman-Morrison- 
Woodbury theorem. Using the LOOMSE as the model term 
selective criterion to seek the model generalization, an efficient 
orthogonal forward regression (OER) procedure have been 
introduced Q. Eurthermore, the (^-norm based regularization 
techniques E-nni have been incorporated into the OLS 
algorithm to produce a regularized OLS (ROLS) algorithm that 
carries out model term selection while reduces the variance 
of parameter estimate simultaneously HD. The optimization 
of Z^-norm regularizer with respect to model generalization 
analytically is however less studied. 

In this contribution, we propose a Z^-norm penalized OER 
(Z^-POER) algorithm to carry out the regularizer optimization 
as well as model term selection and parameter estimation 
simultaneously in a forward regression manner. The algorithm 
is based on a new Z^-norm penalized cost function with 
multiple Z^ regularizers, each of which is associated with an 
orthogonal basis vector by orthogonal decomposition of the 
regression matrix of the selected model terms. We derive a 
closed form of the optimal regularization parameter in terms 
of minimal LOOMSE. To save computational costs an inactive 
set is used along the OER process by predicting whether any 
model terms will be unselectable in future regression steps. 

11. Preliminaries 

Consider the general nonlinear system represented by the 
nonlinear model m, 03: 

y{k) = f{x{k))+ v{k), (1) 

where x{k) = [xi{k) X 2 {k) ■ ■ ■ Xm{k)]^ S K'" denotes the 
m-dimensional input vector at sample time index k and y{k) is 
the system output variable, respectively, while v{k) denotes the 
system white noise and /(•) is the unknown system mapping. 

The unknown system ([D is to be identified based on an 
observation data set = {x{k),y{k)}^_^ using some 

suitable functional which can approximate /(•) with arbitrary 
accuracy. Without loss of generality, we use to construct 
a radial basis function (RBE) network model of the form 

M 

yW(fc) = f^^\x{k)) ='^ei<j),{x{k)), (2) 

where ^^\k) is the model prediction output for x{k) based 
on the M-term RBE model, and M is the total number of 
regressors or model terms, while 9i are the model weights. 
The regressor (j)i{x) is given by 

=exp(^-(3) 

in which = [ci_i C 2 ^i ■ ■ ■ is known as the center 

vector of the ith RBE unit and r is an RBE width parameter. 
We assume that each RBE unit is placed on a training data. 
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namely, all the RBF center vectors are selected from 

the training data and the RBF width r has been 

predetermined, for example, using cross validation. 

Let us denote e^^\k) = y[k) — y^^\k) as the M-term 
modeling error for the input data x{k). Over the training data 
set Dn, further denote y = [t/(l) y{2) ■ ■ ■ y{N)]'^, = 

(1) (2) • • • (N)] and $m = [0i </>2 • ■ • 4>m] 

with (/)„ = [(j)n{x{l)) (j)n{x(2)) ■ ■ ■ (j)n{x{N))]^, l<n<M. 
We have the M-term model in the matrix form of 


y = ^M9M + e<^^\ (4) 

where 6m = \6i 02 ■ • ■ 0m] Let an orthogonal decomposi¬ 
tion of the regression matrix be 


where 


and 


= Wm-A-M, 


Am = 


1 01^2 ■ • ■ ai,M 

0 1 ■■■ : 


0 


ajvf-i.M 
0 1 


Wm = [utl ut 2 • • • Wm] 


(5) 

(6) 

(7) 


with columns satisfying wjwj = 0, if z 7 ^ j. The regression 
model (IDi can alternatively be expressed as 


y = WM9M+e^^\ ( 8 ) 


where the ‘orthogonal’ model’s weight vector qm = 
[ffi 52 • • • gu] satisfies the triangular system AmOm = Qm, 
which can be used to determine the original model parameter 
vector Om, given Am and qm- 

Further consider the following weighted /^-norm penalized 
OLS criterion for the model ([8]l 

M 

Le(AM,gM) = ||y — Wm9m|| (9) 

i=l 

where Am = diagjAi, A 2 , • • • , Am}, which contains the local 
regularization parameters Xi > e, for 1 < z < M, and e > 0 is 
a predetermined lower bound for the regularization parameters. 
For a given Am, the solution for qm can be obtained by setting 
the subderivative vector of to zero, i.e. = 0, yielding 


9i 


(olasso) _ 


= 9 


ALS)| 


A,/2 


( 10 ) 


for 1 < z < M, with the usual least squares solution given by 
9?^^^ = 7^’ operator ( ) + 


( z, if z > 0, 
\ 0 , if z < 0 . 


( 11 ) 


Unlike the LASSO HI, El, our objective Lf,(^AM,gM) is 
constructed on the orthogonal space and the (^-norm parameter 
constraints are associated with the orthogonal bases Wi, 1 < 
i < M. Since the cost function @ contains sparsity inducing 
norm, some parameters vvill be returned as zeros, 

producing a sparse model in the orthogonal space spanned by 
the columns of Wm, which corresponds to a sparse model in 
the original space spanned by the columns of $m- 


III. Regularization parameter optimization and 
MODEL CONSTRUCTION WITH LOOMSE 

Each OER stage involves the joint regularization parameter 
optimization, model term selection and parameter estimation. 
The regularization parameters with respect to their associated 
candidate regressors are optimized using the approximate 
LOOMSE formula that is derived in Section IIII-BI and the 
regressor with the smallest LOOMSE is selected. 


A. Model representation and LOOMSE in n-th stage OFR 

Consider the OER modeling process that has produced 
the zz-term model. Let us denote the constructed n columns 
of regressors as W„ = [tui W 2 ■ ■ ■ lu„] , with Wn = 
[zu„(l) Wn{2) ■ ■ •zUn(A)]"^. The model output vector of this 
n-term model is given by 

n 

^(olasso)(12) 

and the corresponding modeling error vector by = y — 
Clearly, the nth OER stage can be represented by 

+ (13) 


The model form (fOl l illustrates the fact that the nth OER 
stage is simply to fit a one-variable model using the current 
model residual produced after the (n—l)th stage as the desired 
system output. Since w'^y^~^'^ = 0, it is easy to verify that 
(LS) _ wlv 


nT^(n-l) 


The selection of one regressor from the candidate regressors 
involves initially generating candidate Wn by making each 
candidate regressor to be orthogonal to the (n — 1 ) orthogonal 
basis vectors, Wi for 1 < z < n — 1 obtained in the previous 
(n — 1) OER stages, followed by evaluating their contributions. 


Consider the case of 2\w^ el" 1 > e. Applying (fTol) to (fTSI) . 


„T„(n-l)| 

we note that clearly as An decreases away from | 

towards e, increases its magnitude at a linear rate to 


An, from zero to an upper bound p 


(B)| 


with 


»'■=> = (kr’i - 


2w'Iw, 


-) sign(p}}‘®)). (14) 

I. ' + 


Eor any candidate regressor, it is vital that we evaluate its 
potential model generalization performance using the most 
suitable value of An. The optimization of the LOOMSE with 
respect to An is detailed in Section IIII-BI based on the idea 
of the LOO cross validation outlined below. 

Suppose that we sequentially set aside each data point in the 
estimation set Dj^ in turn and estimate a model using the re¬ 
maining (N — l) data points. The prediction error is calculated 
on the data point that has not been used in estimation. That 
is, for k = ,N, the models are estimated based on 

D]s[ \ {x{k),y{k)), respectively, and the outputs are denoted 
as yA„). Then, the LOO prediction error based on 
the Icth data sample is calculated as 


(fc, A„) = y{k) - Xn). (15) 
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The LOOMSE is defined as the average of all these prediction 
errors, given by J(A„) = E (fc, A„))^ . Thus the 

optimal regularization parameter for the nth stage is given by 

, N 

A°P* = arg min|j(A„) = ^ 

" ^ k=l 


Evaluation of J{\n) by directly splitting the data set re¬ 
quires extensive computational efforts. Instead, we show in 
Section IIII-BI that J(A„) can be approximately calculated 
without actually sequentially splitting the estimation data set. 
Eurthermore, we also show that the optimal value A°p* can be 
obtained in a closed-form expression. 


B. Optimal regularization parameter estimate 

We notice from (fTol i that = 0 if | < 

A„, and thus a sufficient condition that a given may be 
excluded from the candidate pool without explicitly deter¬ 
mining A„ is I < e, which is the regularizer’s 

lower bound, a preset value indicating the correlation of the 
candidate regressor. Hence, in the following we assume that 
I > e, and we have 

^(oiasso) ^ y _ A„sign(g(LS))/2^ , (17) 

1 (olasso) r (olasso) (olasso) (olasso)"] T . / x 

where gA = [9i ■■'an , sign(g„) 

= [sign(gi) sign(g 2 ) • • •sign(g„)] , and = W^^Wn- 
Note that (fTTI i is consistent to (flOl i for all terms with nonzero 
Pi. In the OER procedure, any candidate terms Wi producing 
zero will not be selected since they will not contribute 

to any reduction in the LOOMSE. 

The model residual is defined by 

= y{k) - {y^Wn - (sign(g(^®)))'^A„/2)ff“^u;(fc), 

(18) 

where w{k) denotes the transpose of the fcth row of W„. If 
the data sample indexed at k is removed from the estimation 
data set, the LOO parameter estimator obtained by using only 
the {N — 1) remaining data points is given by 

g{0l^S0,-k) ^^^i-k)y^ (^l^^{-k)^^ y[-k) _ 

A„sign(g(LS,-0)/2) (19) 

in which wf'’ and 

denote the resultant regression matrix and desired output 
vector, respectively. It follows that we have 

=Hn-w{k)w'^{k), (20) 

(y(-O)T^^-fc) ^ yT^^ _ y^k)w'^(^k). (21) 

The LOO error evaluated at k is given by 

e^^’->^\k,Xn) = y{k) - 

(sign(g(LS,-0))TA„/2) (22) 


Applying the matrix inversion lemma to (l20l i yields 

={Hn-w{k)w^{k))-^ 

H-^w{k)w^{k)H~^ 


=H-^ + 


and 


{hY'>) 'w{k) = 


1 — w'^(k)Hn ^w(k) 

1 — w'^{k)Hn^w{k) 

Substituting (1211 1 and (l24l i into (l22T i yields 

A„) = y{k) - (fWr, - y{k)w^{k)- 


(23) 


(24) 


1 — w'^{k)Hn ^w{k) 

y(k) - [y'^Wn - (sign(gL® 


l-w'^{k)Hnw{k) 


(25) 


Assuming that sign(gi^®’ = sign(gi^®^) holds for most 
data samples and then applying ( fTSl l to ( |25] ). we have 

e("’-'=)(fc,A„) = 7 „(fc)e(")(fc,A„), (26) 

where 7 n(fc) =- ^ —^ 3-7 - > 0 , and wAk) is the 

i-Er=i (’"i(o) 

fcth element of Wi. The LOOMSE can then be calculated as 
1 ^ 

(27) 

^ k^l 

We point out that in order for sign(gi'^^’ and signfgi^®^) 

fLS) 

to be different, each element in gk needs to be very close to 
zero, which is unlikely since only the model terms satisfying 
> e/2 are considered. Hence we can treat J(A„) 
given in (l27l i as the exact LOOMSE for any preset e that is 
not too small. 

We further represent (fTsT i as 

e("')(fc,A„) = rjik) + ^ ( 2 g) 

where p{k) = Wn(k) is the model residual 

obtained based on the least square estimate at the nth step 
stage. By setting = 0, we obtain A„ in the form of the 

weighted least square estimate 

A„ = -2sign{gY)w'^WnwY^'’v/wY'^'’'Wn, (29) 
where r(’^) = diag{72(l), 72(2), • • • , 72(7V)} and g = 
[7(1) 7(2) • • • 77(A)] G K.'^. Einally we calculate 

A°P* = max I min |, - 2 sign(g]/"®))iu]]’iu„ 

xwY^\/wlT^^'>Wr,},eY (30) 

in order to satisfy the constraint that e < Ar < 2 | 

For obtained using (l30l t, we consider the following two 
cases: 

1) If A°P* = 2|i(;Te("-i)|, then = 0, and this 

candidate regressor will not be selected. 

2) If £ < A°P* < 2|i(;Te("-i)|, then calculate J(A°p*) 
based on HTTl as the LOOMSE for this candidate 
regressor. 
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C. Moving unselectable regressors to the inactive set 

From Section IIII-BI we noted that a candidate regressor 
satisfying < e does not need to be considered 

at the nth stage of selection. To save computational cost, we 
define the inactive set S as the index set of the unselectable 
regressors removed from the pool of candidates. 

In the nth OFR stage, all the candidate regressors in 
the candidate pool are made orthogonal to the previously 
selected (n —1) regressors, and the candidate with the smallest 
LOOMSE value is selected as the nth model term . Denote 
any other candidate regressor as w^~\ 

Main Results: If || • || < |, then this candidate 

regressor will never be selected in further regression stages, 
and hence it can be moved to S. 

Proof: At the (n + l)th OFR stage, consider making the 
regressor w^~'l orthogonal to and define 


w 


( + ) = u;(-) _ \ 


(31) 


Clearly, 


=lw^ ' -- Wn) [W 


II (-)l |2 
= ku'- k 




) (’ 

\2 




(_) _ 


-w. 


Wr, 


< lllU*' ^11 . 


Wr, 


The model residual vector after the selection of w„ is 


g(n) ^ g(n-l) 


9 


(olasso) 




(32) 


(33) 


where can be written as 

^(olasso) ^ 


Thus we have 


Ir I 


-2fl. 


(olasso).j^Tg(n-l) 


(^^olasso)^ 


2 X 
Wrr Wn, 


(35) 


‘^wZwn. =1 [w:,e 


((^ 


A„sign(pJ;;®))t(jJe(” /w'^w^, (36) 


and 


tuTe(-i) = (2 (tuTe("-i))" - 

A„ sign /w'^Wrr. (37) 

Substituting (l3^ and dTTl i into (iTSl l yields 

|2 II ('„_i'i||2 


|e<"l|r =||e<"-«f - T - 


< e 


(ra-1) I 


Since ||tu(+)| 


I 


is the upper bound of 


I £ 

l<2' 


this means that this regressor will not be selected at the (n 
l)th stage. By induction, it will never be selected in further 
regression stages, and hence it can be moved to S. 


IV. The proposed C-POFR algorithm 

The proposed k-POFR algorithm integrates (i) the model re¬ 
gressor selection based on minimizing the LOOMSE; (ii) regu¬ 
larization parameter optimization also based on minimizing the 
LOOMSE; and (iii) the mechanism of removing unproductive 
candidate regressors during the OER procedure. Define 


pAxM 


(40) 


WnWn. (34) 


(38) 

due to the fact that Erom ( 1^ and ( 1^ . it 

can be concluded that 


(39) 


A") 


#("-!) = [tui . . . Wn-1 4"-') • • • G 

with = #M- If some of the columns in have 

been interchanged, this will still be referred as for 

notational simplicity. 

TABLE I 

The riTH STAGE OF THE SELECTION PROCEDURE. 


For {n < j < M} n {j ^ <S}, denote the fcth element of 00 ^ 
as compute aj = 1 )^ and /3j = 

||0(."-O||.||eO-i)||. 

Step 1): If 0j < e/2, S = S U j'. Else if \aj | < e/2, set as a very 
large positive number so that it will not be selected in Step 4). Otherwise 
goto step 2). 

Step 2): Calculate 


.0) = 


3n 


(LS,j) _ 


M) ’ 


(41) 

(42) 


rO.4) = diag. 


(co-0(i)-(<).f-'^(i))V«' 


U) 


(C("-0(2) - (flik 0(2))2 /k0') 


(cO-i)(Ar)-(0(" ^\n))^/ 

r,U) = e(n-i) _g£LS,H^O-i)^ 


(43) 


(44) 


_ max I min |2|oj |, — 

Step 3): If Jn^ as a very lai'ge positive number so 

that it will not be selected in Step 4); Otherwise calculate 


^(olassoj) ^ 


„(n,j) ^ _ ^£olasso,j)^(n-l) 






(46) 


(47) 

(48) 


Step 4): Find 

Jn = ^ = min { j0\ {l<j<M}n{jtS}]. (49) 

Then update and as and ^ respectively. 

The jnth and the nth columns of are interchanged, while the ^Vith 

column and the nth column of Am are interchanged up to the (n — l)th 
row. This effectively selects the nth regressor in the subset model. The 
modified Gram-Schmidt orthogonalisation procedure (4) then calculates the 
nth row of the matrix Am and transfers into as follows 

,(n— 1 ) s 

Wn = <Pn \ I 

“nj = /w'^Wn, {n+1 < j < M}n{j ^ <S}, i (50) 

- anjWn, {n + 1 < j < M} n {j f <S}. J 

Then update 0")(A:) = (k) — (w„{k))^ /wfwn for 1 < fc < Af. 
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Fig. 1. Engine Data: (a) the system input u{k), (b) the system output y{k), and (c) the evolution of the size of S with respect to the chosen e. 


The initial conditions are as follows. Preset e > 0 as a very 
small value. Set = y, (fc) = 1 for 1 < fc < and S 
as the empty set 0. The nth stage of the selection procedure 
is listed in Table U The OFR procedure is automatically 
terminated at the (rig + l)th stage when the condition 

Jris + l ^ Jus (51) 

is detected, yielding a subset model with Ug significant regres¬ 
sors. It is worth emphasizing that there always exists a model 
size Us, and for n < Ug, the LOOMSE Jn decreases as n 
increases, while the condition ( fSTT i holds I?], ifT^ . 

Note that the LOOMSE is used not only for deriving the 
closed form of the optimal regularization parameter estimate 
A°P* but also for selecting the most significant model regressor. 
Specifically, a regressor is selected as the one that produces 
the smallest LOOMSE value as well as offering the reduction 
in the LOOMSE. After the Ug stage when there is no reduction 
in the LOOMSE criterion for a few consecutive OER stages, 
the model construction procedure can be terminated. Thus, the 
Z^-POER algorithm automatically constructs a sparse rig-term 
model, where typically rig M. 

Also note that it is assumed that e should not be too 
small such that the LOOMSE estimation formula can be 
considered to be accurate. This means that if e is set too low, 
many insignificant candidate regressors will have inaccurate 
LOOMSE values for competition. However, we emphasize 
that these terms with inaccurate LOOMSE values will not be 
selected as the winner to enter the model. Hence in practice 
we only need to make sure that e is not too large, which would 
introduce unnecessary bias to the model parameter estimates. 
Clearly, a relatively larger e will save computational costs by 
1) resulting in a sparser model, and 2) producing a larger sized 
inactive set during the OER process. 

Einally, regarding the computational complexity of the l^- 
POER algorithm, if the unproductive regressors are not re¬ 
moved to the inactive set S during the OER procedure, it 
is well known that the computational cost is in the order of 
0{N) for evaluating each candidate regressor iflTl . The total 
computational cost then needs to be scaled by the number of 
evaluations in forward regression, which is M{M — ng)j2. 
By removing unproductive regressors to S during the OER 
procedure, the computational cost can obviously be reduced 
significantly. It is not possible to exactly assess the computa¬ 
tional cost saving due to removing the unproductive regressors, 
as this is problem dependent. 


V. Simulation Study 

Example 1: This Engine Data set da contains the 410 
data samples of the fuel rack position (the input u(k)) and 
the engine speed (the output y{k)), collected from a Leyland 
TLll turbocharged, direct injection diesel engine which was 
operated at a low engine speed. The 410 input and output data 
points of the engine data set are plotted in Eig.[T](a) and (b), re¬ 
spectively. The first 210 data samples were used in training and 
the last 200 data samples for model testing. The previous study 
has shown that the data set can be modeled adequately using 
the system input vector x{k) = [i/(fc — 1) u{k — l) u{k — 2)f^, 
and the best Gaussian RBE model was provided by the 
norm local regularization assisted OLS (LROLS) algorithm 
based on the LOOMSE (LROLS-LOO) ifT^l which was quoted 
in Table HI] for comparison. The e-SVM algorithm M and 
the LASSO were also experimented based on the Gaussian 
kernel with a common variance r^. Eor the e-SVM, the Matlab 
function quadprog.m was used with the algorithm option set as 
‘interior-point-convex’. The tuning parameters in the e-SVM 
algorithm, such as soft margin parameter C iflbll . were set 
empirically so that the best possible result was obtained after 
several trials. Eor the LASSO, the Matlab function lasso.m 
was used with 10-fold CV being used to select the associated 
regularization parameter. Eor both the e-SVM and LASSO, we 
list the results obtained for a range of kernel width r values 
in Table [III for comparison. 

TABLE II 

Comparison of the modeling performance for Engine Data. The 

COMPUTATIONAL COST SAVING IS BASED ON THE SAME SIZE OE MODEL 
WITHOUT REMOVING UNPRODUCTIVE REGRESSORS IN THE I^-POFR. 


Algorithm 

MSE 

MSE 

Model 

Cost 


training set 

test set 

size 

saving 

LROLS-LOO 1141 

0.000453 

0.000490 

22 

NA 

£-SVM (t = 3) 

0.000502 

0.000482 

208 

NA 

£-SVM (t = 2.5) 

0.000480 

0.000475 

208 

NA 

£-SVM (t = 2) 

0.000461 

0.000486 

208 

NA 

£-SVM (t = 1.5) 

0.000415 

0.000579 

208 

NA 

£-SVM (t = 1) 

0.000370 

0.000794 

208 

NA 

LASSO (t = 1.5) 

0.000923 

0.001010 

70 

NA 

LASSO (r = 1) 

0.000708 

0.000748 

44 

NA 

LASSO (t = 0.5) 

0.000706 

0.000842 

54 

NA 

LASSO (t = 0.2) 

0.000565 

0.000800 

81 

NA 

LASSO (t = 0.1) 

0.000644 

0.001907 

76 

NA 

lUpOFR (£ = 10-“) 

0.000498 

0.000502 

20 

27% 

/l-POFR (£ = 10“®) 

0.000492 

0.000480 

20 

18% 

Il-POFR (£ = 10-®) 

0.000484 

0.000485 

20 

8% 

/l-POFR (£ = 10“'^) 

0.000481 

0.000476 

20 

3% 

Il-POER (£ = 0) 

0.000452 

0.000472 

21 

0% 





















































6 


Similar to the LROLS-LOO algorithm lfT4ll . we also used the 
Gaussian RBF kernel Q for the proposed /^-POFR algorithm 
with an empirically set r = 2.5 and the RBF centers Ci were 
formed using all the training data samples. With a preset value 
of £, a sparse model of size Ug was automatically selected 
when the condition (ISTT i was met. Fig. [T] (c) illustrates the 
evolution of the size of S with respect to a range of the preset 
£ values. The test MSB values produced by the sparse models 
and the sizes of the models associated with the same range of e 
values are recorded in Table |II] which show that the excellent 
model generalization capability of all the models generated 
by the proposed algorithm. Moreover, the /^-POFR algorithm 
produces the sparsest model. 

Example 2: This regression benchmark data set, Boston 
Housing Data, is available at the UCl repository JTtI . The 
data set comprises 506 data points with 14 variables. The 
previous study m performed the task of predicting the 
median house value from the remaining 13 attributes using 
the e-SVM ifTbl . the LROLS-LOO lfT4l and the nonlinear OLR 
based on the LOOMSB (NonOLR-LOO) HS). The NonOLR- 
LOO algorithm ifTSl constructs a nonlinear RBL model in the 
OLR procedure, where each stage of the OLR determines one 
RBL node’s center vector and diagonal covariance matrix by 
minimizing the LOOMSB. In the experiment study presented 
in Ga, 456 data points were randomly selected from the 
data set for training and the remaining 50 data points were 
used to form the test set. Average results were given over 
100 realizations. Lor each realization, 13 input attributes were 
normalized so that each attribute had zero mean and standard 
deviation of one. We also experimented with the LASSO 
supplied by Matlab lasso.m with option set as 10-fold CV to 
select the associated regularization parameter. Lor the LASSO, 
a common kernel width r was set for constructing the kernel 
model from the 456 candidate regressors of each realization, 
and a range of r values were experimented. 

Lor the Z^-POLR, r = 15 was empirically set for construct¬ 
ing 456 candidate Gaussian RBL regressors of each realization. 
We experimented a range of the preset e values for the l^- 
POLR algorithm, and the results obtained are as summarized 
in Table |III] in comparison with the results obtained by the 
e-SVM and the LASSO, as well as the LROLS-LOO and 
NonOLR-LOO, which are quoted from the study GSl. 

TABLE III 

Comparison of the modeling performance for Boston House 

Data. The results were averaged over 100 realizations and 
GIVEN AS mean ± STANDARD DEVIATION. 


Algorithm 

MSE 

training set 

MSE 

test set 

Model 

size 

e-SVM 116l 

6.80 ± 0.44 

23.18 ±9.05 

243 ± 5.3 

LROLS-LOO 1141 

12.97 ±2.67 

17.42 ± 4.67 

58.6 ± 11.3 

NonOFR-LOO 1181 

10.10 ±3.40 

14.07 ± 3.62 

34.6 ±8.4 

LASSO (r = 2) 

LASSO (r = 3) 

LASSO (r = 5) 

LASSO (r = 10) 

8.52 ±3.57 
8.55 ± 1.07 
10.45 ± 1.07 
16.42 ± 1.78 

14.37 ±8.15 
13.31 ± 6.65 
15.05 ± 8.37 
19.39 ± 8.31 

76.8 ±39.7 
68.6 ±29.3 

85.9 ± 19.7 

29.9 ±21.3 

L-POFR (e = 0.01) 
F-POFR (e = 0.001) 
F-POFR (e = 0.0001) 
F-POFR (e = 0.00001) 

9.99 ± 1.37 
9.24 ± 1.57 
9.07 ± 1.64 
9.08 ± 1.64 

14.47 ± 7.47 
14.10 ±7.02 
14.02 ± 6.85 
13.95 ± 6.76 

30.5 ± 5.3 
34.9 ± 7.8 

36.6 ±9.3 
36.5 ±9.3 


VI. Conclusions 

We have developed an efficient data model algorithm, 
referred to as the Z^-norm penalized orthogonal forward regres¬ 
sion (Z^-POLR), for linear-in-the-parameters nonlinear models 
based on a new /^-norm penalized cost function defined in 
the constructed orthogonal modeling space. The LOOMSB is 
used for simultaneous model term selection and regularization 
parameter estimation in a highly efficient OLR procedure. 
Additionally, we have proposed a lower bound of the regular- 
isation parameters for robust LOOMSB estimation as well as 
detecting and removing insignificant regressors to an inactive 
set along the OLR process, further enhancing the efficiency of 
the OLR procedure. Numerical studies have been utilized to 
demonstrate the effectiveness of this new /^-POLR approach. 
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